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Abstract. The incorporation of unlabeled data in regression and clas- 
sification analysis is an increasing focus of the applied statistics and 
machine learning literatures, with a number of recent examples demon- 
strating the potential for unlabeled data to contribute to improved pre- 
dictive accuracy. The statistical basis for this seniisupervised analysis 
does not appear to have been well delineated; as a result, the underly- 
ing theory and rationale may be underappreciated, especially by non- 
statisticians. There is also room for statisticians to become more fully 
engaged in the vigorous research in this important area of intersection 
of the statistical and computer sciences. Much of the theoretical work 
in the literature has focused, for example, on geometric and structural 
properties of the unlabeled data in the context of particular algorithms, 
rather than probabilistic and statistical questions. This paper overviews 
the fundamental statistical foundations for predictive modeling and the 
general questions associated with unlabeled data, highlighting the rele- 
vance of venerable concepts of sampling design and prior specification. 
This theory, illustrated with a series of central illustrative examples 
and two substantial real data analyses, shows precisely when, why and 
how unlabeled data matter. 

Key words and phrases: Bayesian analysis, Bayesian kernel regression, 
latent factor models, mixture models, predictive distribution, semisu- 
pervised learning, unlabeled data. 

1. INTRODUCTION 

Recent interest in the use of so-called unlabeled 
data in problems of prediction in the machine learn- 
ing community has generated a growing awareness 
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of the potential for incorporation of ancillary design 
data in classification and regression problems (Ben- 
nett and Demiriz, 1999; Blum and Mitchell, 1998; 
Joachims, 1999; Szummer and Jaakkola, 2002; Zhu, 
Ghahramani and Lafferty, 2003; Belkin, Niyogi and 
Sindhwani, 2004). This use of unlabeled data is of- 
ten referred to as semisuper vised learning. Main- 
stream probabilistic thinking is relatively underrep- 
resented in this active and exciting literature, and 
the theoretical underpinnings of algorithms that ex- 
ploit unlabeled data have received scant attention 
from statistical scientists. Much of the activity is 
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algorithmic and applied. Machine learning exam- 
ples are typically presented case-by-case, with the 
semisupervised analysis usually based on modifica- 
tions of (fully supervised) optimization algorithms 
for classification or regression prediction, and with 
the introduction of additional components of objec- 
tive functions that tie in unlabeled samples. Argu- 
ments for these additional components are made us- 
ing a combination of structural and intuitive argu- 
ments, including, most recently, asymptotic argu- 
ments on the convergence of operators on manifolds 
(Belkin and Niyogi, 2005; Coifman et al., 2005a, b). 
There has been some work addressing the theoretical 
aspects of unlabeled data (Castelli and Cover, 1995; 
Seeger, 2000; Cozman and Cohen, 2002; Ando and 
Zhang, 2005) in specific contexts. However, in gen- 
eral, the foundation and rationale for understanding 
the relevance, and likely effectiveness, of unlabeled 
data are still not well understood. 

For currently active application areas and also 
to underlie growth and development of the unla- 
beled data methodology long-term, it is critical that 
the underlying theoretical basis for the use of unla- 
beled data is delineated and more broadly under- 
stood among statistical and computational scien- 
tists. Our goal here is to promote broader awareness 
and interest among statisticians of the nature and 
importance of this area. We do this by outlining the 
conceptual and theoretical bases for the "when, why 
and how" in regard to the use of unlabeled data, and 
through a complementary series of illustrations in 
central statistical modeling contexts as well as em- 
pirical examples in two substantive data analyses. 

Beginning in Section 2 with an articulation of the 
basic model framework and discussion of fundamen- 
tal issues of sampling and design, we discuss the un- 
derlying conceptual and theoretical basis for using 
unlabeled data. This is developed in the Bayesian 
framework for prediction, in which implications for 
the incorporation, or otherwise, of unlabeled data 
in prediction problems becomes transparent. Sec- 
tion 3 provides concrete, illuminating examples in 
a series of common statistical models. This includes 
examples in regression, prediction using multivari- 
ate normal mixture models, and standard mixture- 
based classification and discrimination. These are 
key contexts that connect intimately with some of 
the major areas of interest in machine learning, and 
contexts in which the relevance of unlabeled data is 
perhaps most transparent and intuitive. These ex- 
amples serve to highlight the relevance of unlabeled 



data in standard, central areas of statistics. Sec- 
tion 4 overviews and exemplifies the issues in a class 
of latent factor regression models, with an empirical 
illustration in analysis of a benchmark data set of 
handwritten digit classification. Section 5 concerns 
our final important context, that of kernel regres- 
sion; here we link statistical and machine learning 
approaches, illustrate the theoretical basis for the 
use of unlabeled data and provide a further empir- 
ical study in discrimination of cancer and normal 
tissue samples based on gene expression data. We 
close with summary comments in Section 6. 

2. GENERAL FRAMEWORK 
2.1 Context, Goals and Models 

Interest lies in aspects of the joint distribution of 
two random quantities, {y,x), and the core predic- 
tion problem concerns statements about future val- 
ues of y based on observing the corresponding x. 
Both X and y may be multivariate, in general. In 
standard regression problems, y is a continuous or 
discrete univariate response; in problems of classi- 
fication, y is discrete, often binary. Using p(-) as 
generic notation for probability density functions, 
all inference problems require understanding aspects 
of the joint density p{y,x\-k), where ★ denotes all 
parameters — to be described below in context — that 
are needed to fully specify the joint density. 

The fundamental problem of prediction — whether 
it be couched in terms of regression estimation or 
classification — is framed as follows: at a specified 
"future" value of x, make statements about the cor- 
responding value of y. Using * to denote future val- 
ues of interest, this implies a directional focus: we 
want to understand and evaluate, or estimate, 
piy*\xt,,D) based on all available data and infor- 
mation D. 

Statistical models structure the problem in terms 
of parameters (which may be infinite-dimensional 
in nonparametric models) that represent all uncer- 
tain aspects of the joint probability distribution for 
{y,x). By way of notation, the dominant and gener- 
ally (our) preferred specification of the joint density 
is 

(1) p{y,x\4>,e) =p{y\x,(l))p{x\e), 

where the functional forms of the two densities on 
the right-hand side are completely specified by the 
characterizing parameters {4>,9). The parameters cj) 
and 6 relate explicitly to the conditional for y given 
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X and then the marginal for x, respectively. Though 
(p and 6 are two distinct symbols in notation, they 
can be structurally dependent in various ways, as we 
will see later. From this joint density, we can also 
deduce the implied marginal density for y, p{y\(j), 9), 
and the implied conditional density p{x\y,(j),0) via 
the complementary factorization 

(2) p{y,x\4>,e)=p{x\y,(t>,9)p{y\4>,e), 

where the full set of parameters (i;^, ^) may be in- 
volved, in complicated ways, in the "retrospective" 
conditional for x given y, and the corresponding 
marginal for y. 

The conditional density of y given x is essential for 
prediction, of course, and hence we center our devel- 
opment on the representation (1), in the knowledge 
that we can move interchangeably between factor- 
izations (1) and (2) as desired. 

2.2 Sampling Designs 

The stochastic model of the data generation pro- 
cess, referred to as the sampling design, leads to like- 
lihood functions as summaries of the data-based in- 
formation on {(p,0). Typical sampling contexts fall 
into the following categories: 

1. Data from the margins: 

• y™ = = 1 : km} where the y™ ~p(y|</', 0) 
are independent, and/or 

• X"' = {xf',i = {km + 1) : (km + rim)} where the 
X™ ^ p{x\6) are independent, 

and with and X"^ independent given {(j),0). 
Having the opportunity to observe data y™ pro- 
vides information on aspects of the full set of 
parameters {(j),0), while informs on aspects 
of 9 alone. X"^ is the traditional unlabeled data, 
though the same term could also be applied to 

2. Full prospective random sampling in which (YP, 
X^) = {{yf,x^);i = l:np} are drawn from the 
full joint distribution p{y,x\4>,9). Here data are 
paired and provide information on both 9 and (p. 
This is a common classification and/or regression 
design. 

3. Data from a prospective design in which the X^ = 
{x^,i = l:np} values above are specified in ad- 
vance by design. Then X^ contains no informa- 
tion about the parameters and we learn about the 
parameter (p (only) through the likelihood based 
on (yP, XP) that is the product of components 



p{yf\x^, (p) — this is the venerable and perhaps the 
most common regression design in applied statis- 
tics. In machine learning the term transductive 
framework, outlined by Vapnik (1998), has been 
applied in this setting when the objective is to 
make predictions of y on only some specific, pre- 
specified values of x. 
4. Data from a typical retrospective design — or case- 
control design — in which we observe the outcomes 
X^ = = 1 -.nr} at a chosen set of y values 
= i = 1 : rir}. Here too the data are paired, 
but y provides no information about ((/>, 9) since 
the y values are chosen by design. The data in X'^ 
comprise a set of independent random draws 
from p{x\y,(p,9) and therefore provide informa- 
tion about {(p, 9). 

The difference between "prospective" and "retro- 
spective" is whether the observed y values are ran- 
dom or not. Since most examples we will discuss 
come from a prospective design, for notational sim- 
plicity we will drop the superscript and use (Y^^) 
to denote {Y^^X'p). Other sampling schemes arise 
in statistical design (e.g., matched case-control de- 
signs, repeated measurement designs), but the above 
examples are key and central to much of predictive 
modeling and to our main goals of explicating the 
use of unlabeled data. Finally we note that the ma- 
chine learning community has used the term "sam- 
pling" for a somewhat different use, applying it to 
diff'erent factorizations of the joint distribution as- 
suming the data were generated by a full random 
sampling of {y,x); in that usage, the form (1) is re- 
ferred to as "diagnostic sampling" and (2) is referred 
to as "generative sampling" (Cozman and Cohen, 
2002). 

2.3 Prediction 

We observe data D generated via one or a com- 
bination of the sampling designs mentioned above. 
We aim to predict (estimate, classify) a new case 
at a value rr* . The prediction problem is solved from 
the Bayesian perspective by evaluating the posterior 
predictive distribution 

p{y^\x^,,D) 

(3) 

~JJ 'P{y*\^*^^)p{4>-,9\x^,D)d(p d9 

at the value of the future x*, where p{(p,9\xif,D) is 
the posterior distribution of the parameters given 
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the data and x^,. This posterior predictive distribu- 
tion is the relevant quantity whether is a ran- 
dom draw from p(x\6) or is specified directly. In 
the former case arises as a sample from p{x\6) 
and so provides additional information about 9; then 
p{(j),9\x^:, D) depends on x*. In the latter case x^ IS 
chosen at a value of interest, often one of a range 
of values where we aim to explore potential future 
outcomes, and so provides no additional informa- 
tion; then 

(4) p{cj),9\x,,D)=p{cl),9\D). 

In any example it is important to be aware of the 
distinction but, for our development, it is a side issue 
and we assume the latter case (4) as it simplifies the 
notation. 

Our interest focuses on how X™ enters in the eval- 
uation of the predictive density in (3). All forms of 
information enter through D, so for X™ (and any 
other information) to be relevant in prediction it is 
necessary that it play a role in defining the posterior 
p{(j),9\D). This is the key to understanding, if, and 
how, any information in D impacts the prediction 
problem. 

A relatively general framework has observations 
on each of Y"", X™, {Y,X) and Then 
Bayes' theorem under a specified prior p{<j), 9) yields 

p{(l),9\D)cxp{(l,,9)p{D\cl),9) 

with 

p{D\<P,9)=p{Y,X\,p,9)p{X^\9) 

■p{Y^\<l>,9)p{X'\Y\<l^,9). 

This posterior will depend in complicated ways on 
all aspects of D, including aspects of the unlabeled 
data X"^ . Investigating this dependence is the key 
to understanding the relevance and specific potential 
uses of unlabeled data. 

2.4 Common Framework of Regression 
and Classification 

For convenience and clarity, we start our discus- 
sion in the simple regression/classification context 
where data arise from a joint random sample D = 
(X,y). Then 

p{(t>,9\D) ^p{(l),9)p{Y\X,(l,)p{X\9). 

For example, we may have a linear or nonlinear re- 
gression model for [y\x,(j)) in which (p represents the 
uncertain regression parameters or regression func- 
tions. 



Now imagine that we have the opportunity to ad- 
ditionally observe or measure some unlabeled data 
X"*. The modified posterior with D = {Y,X,X'^} 
is then 

p{4>, 9\D) oc 9)p{Y\X, (l,)p{X\9)p{X^\9). 

If (j) and 9 are independent under the prior, p{(j), 9) = 
p{4>)p{9), then 

p{c^,9\D)=picP\Y,X)p{9\X"',X). 

Thus, prior independence leads to posterior inde- 
pendence and the unlabeled data is irrelevant 
in learning about cf), and hence irrelevant in predict- 
ing new y*, if (/> and 9 are a priori independent. This 
follows from 

p{y^\x^,D) = j j p{y^\x^,(j))p{(j),9\D)d(l)d9 

= J Piy*\x*,(j))pi(j)\Y,X)d(j) 

by posterior independence. 

In other cases, the posterior for (9, (f>) may — and 
often will — involve dependencies. Therefore, addi- 
tional information generated from marginal data will 
have an impact on the prediction problem via the in- 
tegration over the posterior that defines p{y^:\x^:, D). 
In the general framework, data from Y"^, X"^ and 
will all have an impact on the prediction 

problem. 

It is now evident that, beyond probabilistic/prior 
dependencies in the Bayesian formulation, any struc- 
tural relationship between the "regression compo- 
nent" parameters (j) and the "x-marginal" compo- 
nent parameters 9 will also inevitably lead to de- 
pendence of predictions on the unlabeled data. How 
such dependencies arise and what forms they take 
depend on context. 

A standard and conceptually simple example is 
that of mixture modeling, in which learning about 
the marginal distribution for x informs on the rel- 
ative probabilities of mixture components for the 
joint distribution for (y, x), and hence infiuences pre- 
dictions. This idea-fixing example is developed be- 
low as the first of a series of common modeling 
contexts that illuminate the issues and theoretical 
framework. 

3. SOIVIE CENTRAL l\/IODELING CONTEXTS 
AND EXAMPLES 

3.1 Nonlinear Regression Prediction 
Using Mixtures 

A methodologically central example, and one in 
which the relevance of unlabeled data is transparent. 
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is that of Gaussian mixture modeling for regression. 
Consider the case of univariate y and multivariate 
X, with joint sampling density 

m 

(5) p{y,x) = '^'Kifi{y,x), 

i=l 

where < vTj < 1, J2i = 1 and the fi{y, x) are den- 
sity functions of distinct multivariate normal distri- 
butions. Transforming the joint distribution (5) to 
the prospective parametrization in (1), we have 

m 

p{x\e) =^TTifi{x), 
i=l 
m 

P{y\x,(l)) = '^Wi{x)fi{y\x), 

i=l 

where fi{x) and fi{y\x) are the corresponding mar- 
ginal and conditional densities of the multivariate 
normal fi{y,x), and 



is the conditional mixing probability evaluated at 
the conditioning x value. 

In terms of the general theory and notation of Sec- 
tion 2, the simplest parameter specification has (j) = 
{vr, a, (3} and = {vr, a}, where vr = {vTj : i = 1 : m}, 
P is the full set of linear regression coefficients, in- 
tercepts and conditional variances in the set of m 
normal linear models fi{y\x) and a is the full set 
of mean vectors and variance matrices of the nor- 
mal distributions fi{x). This parametrization makes 
clear the direct structural dependence of (p and 6, 
and hence the deductions from the general theory of 
Section 2 that unlabeled data will matter in future 
predictions. This conclusion is evident by inspection. 
Observing data on the margin x provides direct in- 
formation on the relative weights VTj of the normal 
components, and hence provides information rele- 
vant to predicting future ?/* values. The unlabeled 
data also of course inform about the other parame- 
ters a of the margin for x, that is, the component 
mean and covariance parameters of the marginal 
normal mixture p{x\6) as well as the component 
weights. These parameters are also involved in the 
calculations needed for prediction — through the con- 
ditional mixing probabilities Wi{x^,) — and so the un- 
labeled data play a more intricate role than just ad- 
vising on the weights. 



An illustrative example. A simple illustrative ex- 
ample fixes ideas and shows how unlabeled data may 
increase predictive accuracy in nonlinear regression 
via a mixture model. Consider two-dimensional data 
{y,x) modeled using a three-component Gaussian 
mixture model in the above framework, setting 
fi{y,x) to be the bivariate normal N{fii,T,i), for 
each i = 1 : 3. From the following prior on model pa- 
rameters, we simulated one set of parameters and, 
given those parameters, drew a sample of 175 obser- 
vations from the resulting three-component Gaus- 
sian mixture. Figure l(al) is a scatter plot of the 
175 observations. 

The prior, 

(vri,7r2,^3)~Dir(l/3,l/3,l/3), 

(/ii|Si)~iV(0,r5],), (i = l,2,3) with r = 0.2, 
Si~IW((i,5o), 

(i = l,2,3) with d = 3 and 5o = (4/3)/2x2, 

was used for posterior and predictive analysis of sub- 
sets of this full data set. Here Dir denotes a Dirichlet 
distribution and IW an inverse Wishart, in standard 
notation. 

The standard Gibbs sampler for mixture models 
(Lavine and West, 1992; West, 1992) delivers Monte 
Carlo approximations to posterior and predictive 
distributions. In particular, given posterior samples 
of all model parameters (including the latent mix- 
ture component indicators for each sample), the pos- 
terior mean of the regression curve can be approxi- 
mated pointwise over a range of values x^: = [a, b] to 
deliver the estimated regression function K{y^\x^, D) 
over this range. This is plotted in Figure l(al) for 
the case in which D is the full set of 175 observa- 
tions and X* = [—2,2]. The corresponding estimates 
of the predictive density functions p{y^\x^, D) are 
plotted for three different values of x^ = {—1.5, 0, 2} 
in Figure l(a2). These regression and density curves 
can be viewed as the "gold standards" as they fully 
utilize all the available data. 

Assume now that we can only measure y values 
for X in the range [—1,1]. This leaves us with a 
smaller labeled data set and an unlabeled set X™. 
We can fit the model to the labeled data only, or 
to the labeled and unlabeled data. The Gibbs sam- 
pler can be easily extended to treat the y values 
for the unlabeled X"^ as missing data and draw the 
corresponding labels. Such analysis results in Monte 
Carlo estimates of regression curves and predictive 
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Fig. 1. Mixture model regression and prediction example. 
(a) Analysis using the full data set: (al) scatter plot of all 
175 data points together with the posterior predictive regres- 
sion curve of {y\x) and three contours representing the poste- 
rior estimates of the three Gaussian components; (a2) the es- 
timated predictive density functions p{y\x) evaluated at three 
chosen values x = —1.5, and 2. (b) Analysis using only the 
labeled data, (c) Analysis using the labeled data and also the 
unlabeled points (open circles). 



densities that can be compared to those from the 
full data analysis: Figure 1(b) presents results using 
only the labeled data and Figure 1(c) presents re- 
sults using both the labeled and the unlabeled data. 
Comparison of these graphs with Figure 1(a) strik- 
ingly illustrates the differences, and highlights the 
improvements in prediction that can be obtained by 
incorporating the unlabeled data. 

3.2 Classification and Discrimination 
with Mixtures 

A related and also methodologically central mix- 
ture modeling context is that of classification and 
discrimination, in which x arises from a mixture 
distribution and y indicates the mixture component 
(Lavine and West, 1992). For example, in binary 
classification, for each of y = 0, 1 the model is spec- 
ified with Pr(y = 1) = vr, < vr < 1, and {x\y) ~ 
fy{x) for some parametrized densities /o and /i. 
In the common Gaussian mixture model, /o and /i 
are multivariate normal densities parametrized by 
different means and variance matrices (Lavine and 
West, 1992), fy{x) = N{x\^ty,T.y) for each y = 0,1. 
Define V = {^o> 5]o, Si}. Here the scientifically 
natural specification of the joint distribution is via 
the "retrospective" construction of (2), parametrized 
as p{x\y,il)) andp(y|7r). 

Relating to the factorization of (1), the implied 
marginal for x and conditional for y given x are eas- 
ily deduced; the former is the implied mixture of the 
two normal distributions weighted by probabilities 
vr and 1 — vr, and the latter is simply the revised 
outcome "classification" probability for y = 1 at a 
point X, computed by Bayes' theorem. It is trans- 
parent that unlabeled data matter in this setting. 
Observations from the marginal distribution for x 
provide information about both the mixture weight 
TT and the parameters ^ of the component normals. 
Prediction of a new ?/* at a point x* is performed 
by estimating (whether by formal Bayesian compu- 
tations or otherwise) the classification probability 
Pr(?/* = llx*), which is a complicated function of all 
parameters vr and tp and depends critically on vari- 
ous nonlinear functions of tj: in particular. Hence in- 
formation about {TT,tp) from unlabeled data, as from 
any other source, feeds through to impact on predic- 
tions. 

To connect with the general notation and theory 
of Section 2, we see that for the primary parameters 
vr, -0 there is no simple reduction or separation of the 
parameters into distinct parameters <j) fmd 9. Each 
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of the distributions and p{x\6) depends in 

complicated ways on all the parameters vr and ip, 
and consistency with the notation in Section 2 is 
achieved only by setting (p = 6 = {n,ip}. Hence, in 
this key example, (p and 6 are fundamentally highly 
structurally related, and the general theory of Sec- 
tion 2 implies that predictions will be impacted by 
the use of unlabeled data X™ in concordance with 
our immediate, context-specific deductions above. 

We note that various statistical and algorithmic 
approaches have been proposed to take advantage 
of the information in and the effectiveness of 
X"^ has been either implicitly or explicitly well stud- 
ied methodologically (Ganesalingam and McLach- 
lan, 1978; O'Neill, 1978; Ganesalingam and McLach- 
lan, 1979; Miiller, Erkanli and West, 1996; Nigam 
et al., 2000). An interesting theoretical connection 
in Castelli and Cover (1995) concerns the asymp- 
totics of prediction errors for y,, with respect to an 
increasingly large unlabeled sample, so that asymp- 
totically all parameters are effectively "known" ; the 
basic conclusion of this analysis was that labeled 
samples are exponentially more valuable than unla- 
beled samples in classification problems. 

3.3 Normal Linear Regression IVIodeis 

In the usual normal linear regression model, (j) = 
(/3,r) is the set of regression parameters from the 
model 

y\x,cl)r^ N{(3'x,T^), 

where x and P are /c-dimensional vectors. One way 
such a model can arise is from an assumed joint 
multivariate normal distribution for {y,x), namely, 
the {k -\- l)-dimensional (zero-mean) normal N{0, S) 
where 



for some scalar parameter ay, fc-dimensional vector 
of covariance parameters p and kxk variance matrix 
Tlx- Under such a model we have (3 = T~^p and = 
— 13' p, and the marginal p{x\9) = N{p,x,'^x) with 
the characterizing parameter 9 = {px,'^x)- 
Some example contexts are as follows: 

• A direct specification of the prior p{(j), 6) = p{(/)) • 
p{9) that assumes independence, and so implies 
that unlabeled data will be irrelevant to pre- 
diction of future . This would be typical in many 
applied regression settings. 



• An indirect specification in which the initial prior 
is defined for (^,S), with the prior p{(p,9) being 
implied by transformation. A common approach is 
to use the conjugate normal-inverse Wishart prior 
distribution. Any prior in this class has the prop- 
erty that the implied prior on {<j), 9) is in fact one 
in which (j) and 9 are independent (Geiger and 
Heckerman, 2002; Dobra et al., 2004). 

• Other indirect specifications of the prior p{4), 9) 
by deduction from a prior on S will induce de- 
pendence between (j) and 9 and hence lead to rel- 
evance of the unlabeled data since X"* will then 
provide information about (j) indirectly through 
its relevance for 9. 

The second example here illustrates a case in which 
modeling prior information on parameters of the 
joint distribution of y and x using a standard con- 
jugate implies that the unlabeled X^ data will be 
irrelevant for predicting y*. This result arises more 
generally in exponential family models. Other priors 
may, and usually will, lead to prior and therefore 
posterior dependence so the unlabeled data will be 
relevant. 

3.4 Binary Outcomes: Cancer Incidence 
and Prognosis 

An illuminating example is the case of binary y 
and binary x. For thematic context, suppose x = 1/0 
represents the presence/absence of mutation in the 
BRCAl breast cancer gene in a woman, and that y = 
1 /O represents occurrence of breast cancer before age 
70. The goal here is to predict the probability of 
y = 1 given the presence or absence of the mutation. 

In this breast cancer example we define 9 as the in- 
cidence rate of the BRCAl mutation; (j)Q is the base 
rate for breast cancer in the general (wild type) pop- 
ulation of women, and (pi the (higher) cancer rate 
among carriers of the mutation. The joint density 
using p(y|a;,(/)) and p{x\9) is then parametrized by 
the three probabilities, 4> = ((/)o,(/>i) and 9 where 

• i;^>x- =?!■(?/ = l|x, (/)) for X = {0, 1} and 

• 9 = Y>t{x = 1\9). 

For this model, the predictive distribution given the 
data is 

p* = Pr(y=K = l|x*, I?) = jj (f)x,pi4',9\x^,D)d(l)d9. 

Given the above model, the following are two nat- 
ural prior specifications: 
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• We can directly specify independent priors on 9 
and (j). As a result p^, will not depend on the un- 
labeled data. 

• We have cell probabilities p{x, y) on the joint space 
a; = 0, 1, y = 0, 1 defined as 

71" = {7roo,7roi,7rio,7rii}. 

Common approaches utilize Dirichlet priors on vr. 
If we choose a Dirichlet prior p{tt) and find the 
implied prior p((p, 0) by transformation, the result 
is prior independence of (j) a.nd 9, and again the 
unlabeled data are irrelevant to prediction. 

A more interesting and perhaps natural model- 
ing assumption on the joint space is that the breast 
cancer samples come from an inhomogeneous pop- 
ulation having two genetically and environmentally 
different subpopulations in connection with inher- 
ited breast cancer-related characteristics and life- 
time cancer risks. In this case a reasonable prior 
would be a mixture of two Dirichlets, 

p(7r) = apo(vr) + (1 - a)pi(vr), 

where pq and pi are two different Dirichlet priors for 
the two subpopulations, though the sampling design 
cannot distinguish between the subpopulations. It 
then follows by transformation that 

p{m = w{9)poi^) + (1 - w{e))pi{(t)), 

where po and pi are the implied margins on (j) from 
each of the two Dirichlets, and the mixing probabil- 
ity w{9) is computed conditionally on any value of 
9 as 

w{9) _ a po{9) 
l-w{9) {i-a){l-pi{9))- 

Thus under a mixture prior of this form, 9 and (j) are 
dependent and so the unlabeled data will pro- 
vide information about indirectly via 9 and (p. 
The dependence between (j) and 9 is reflected in the 
variation of the weight w{9) that provides the "link" 
for the unlabeled data information to flow through 
to impact on inferences about <j), and hence to y*. 
Only in the extreme case of no subpopulation struc- 
ture, when po(') =Pi(')) ^ill unlabeled data on 
the mutational incidence rate play no role in pre- 
dicting cancer events for future patients. 



4. FACTOR MODELS AND FACTOR 
REGRESSION 

4.1 Statistical Framework 

The interest in factor regression has increased due 
to the prevalence of problems with high-dimensional 
predictors. One common example is principal com- 
ponent regression (PGR). In PGR, the singular value 
decomposition of the design matrix of original pre- 
dictor variables generates principal components — or 
empirical factors — that become the predictors in a 
regression. The resulting orthogonal regression and 
potential data reduction are two key benefits of this 
modeling approach. However, a key question is raised 
in connection with prediction: since we aim to pre- 
dict values at new, future x* values, should we 
not include the future design points in the initial 
analysis and principal component evaluation? This 
is evidently just a question of whether, and if so, 
how, to use unlabeled data in the model develop- 
ment and analysis of existing labeled data. 

The question, and the general discussion of PGR 
and empirical factor regression, can be embedded 
in the broader theoretical context of (latent) factor 
regression models. West (2003) formalized the devel- 
opment of large-scale, latent factor models coupled 
with regression on latent factors, and delineated a 
comprehensive framework for predictive modeling 
that was particularly motivated by problems involv- 
ing larger numbers of predictors — the "/aryep, small 
n" paradigm. This elucidated the theory underlying 
PGR and modeling using principal component pro- 
jections of high-dimensional covariates/predictors as 
a limiting case of a broader class of regression mod- 
els where the predictors are latent variables. This 
framework and theory also clarified and justified the 
use of so-called y-priors (Zellner, 1986) for Bayesian 
shrinkage regression, and defined novel classes of 
multiple shrinkage methods that are significantly 
beneficial in prediction problems through the ability 
to induce differential shrinkage in different factor- 
predictor dimensions. Importantly, the framework 
trivially clarifies the issue of use of unlabeled data, 
and how unlabeled samples enter into predictions 
based on analysis of labeled data, in general. The 
special limiting case of principal component regres- 
sion is one important benefit. 

The following normal linear model serves as a spe- 
cific example to illustrate the more general princi- 
ples of factor regression models. A univariate re- 
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sponse y is to be predicted based on a (high-dimen- 
sional) p X I predictor variable x, and we have 

yi = a'Xi + £i and Xi = BXi + i^i, 

where £i ~ N{0,a'^), Aj ~ A^(0,/) is a A: x 1 mul- 
tivariate normal latent factor for each z, i? is an 
uncertain p x k matrix of factor loadings of x on 
A, z/j ~ A^(0,^) is a vector of idiosyncratic noise 
terms and ^ is an uncertain diagonal variance ma- 
trix. Also, the and are conditionally (on all 
model parameters) mutually independent over i. 

4.2 Unlabeled Data in Factor Regression Models 

This framework is a key example of when unla- 
beled data matter. Fundamentally, the outcomes y 
to be predicted are modeled as responses in regres- 
sions on latent variables A, and the observed con- 
comitant X variables are related to A, while y and 
X are conditionally independent given A. Thus the 
predictive relevance of x is indirect, through A. 

By marginalizing over A in the joint multivariate 
normal distribution oi y,x and A implied by the 
model specification, it becomes clear that we can 
identify p{y\x^ (j)) normal linear regression of y 
on X with regression parameter vector and residual 
variance making up the parameter (p = (p{a, a, B, ^). 
Also, the implied marginal distribution for x is nor- 
mal with zero mean and variance matrix 9 = BE' + 
^ . Thus, if {B,^} are known, then 9 is known and 
so the observed, unlabeled data X"^ has no influence 
whatsoever in the problem of predicting a future 
given data from either prospective or retrospec- 
tive designs. However, typically {B, ^} are uncer- 
tain and need to be estimated. In this setting: 

• Unlabeled data X"^ provides information relevant 
to estimation of the latent factor model parame- 
ters {i?,^}, and hence of relevance to predicting 
future values via the transfer of information 
through inferences on the future A* related to x^. 

• ^ is dependent on aspects of 9 indirectly through 
their functional associations with the factor model 
parameters, so that any relevant prior p{B, ^, a, a) 
will induce dependencies between (j) and 9. 

4.3 Digit Classification Example 

The MNIST data set (Y. LeCun, 
http://yann.lecun.com/exdb/mnist/) is a stan- 
dard data set used extensively in the machine learn- 
ing community to benchmark binary regression mod- 
els. The data set contains 60,000 images of hand- 
written digits {0, 1, 2, . . . , 9}, where each image con- 
sists of p = 28 X 28 = 784 gray-scale pixel intensities. 



As an example, we consider what is generally re- 
garded as one of the most difficult pairwise compar- 
isons, that of discriminating a handwritten "6" from 
a "9." We frame this as a binary regression problem. 
The predictor space x is transformed via singular 
value decomposition of the initial design matrix of 
784 primary pixel values (after centering), and the 
first two factors are used for predictive discrimina- 
tion of unlabeled samples. 

The data set contains 5918 handwritten "6"s and 
5949 handwritten "9"s. Following Belkin, Niyogi and 
Sindhwani (2004), we take the first 400 observa- 
tions from each class as a training sample and use 
the remaining samples as test cases to be predicted. 
The standard MCMC analysis of the probit regres- 
sion model produces approximate posterior predic- 
tive probabilities of "6" versus "9" for each of the 
several thousand test samples, and we record empir- 
ical prediction error rates based on whether or not 
the predictive probability of the true digit (true la- 
bel) lies below or above 0.5. For the labeled/unlabeled 
evaluation, our analysis is most extreme: we ran- 
domly select just two "6"s and two "9"s to treat as 
labeled, the remaining 398 in each of the two classes 
being regarded as unlabeled. To give an initial indi- 
cation of the relevance of unlabeled data, Figure 2 
plots the projections of the full sets of training and 
test data onto the two factors (first two principal 
components) of the labeled and unlabeled data to- 
gether; the separation of digits is quite strong and 
clear for both the training and test data. Repeating 
the factorization and projection of the training data, 
but now using only four labeled samples (randomly 
selected with two of each digit), produces the graph 
in Figure 3(a); frames (b), (c) and (d) of Figure 3 
show similar plots for different random draws of the 
four samples treated as labeled. The relevance of 
unlabeled samples is quite evident from comparison 
of these plots with those based on the labeled and 
unlabeled data together. 

In the probit factor regression models using the 
first two principal components as predictors, and 
with a simple, standard normal/inverse gamma prior 
on regression parameters (West, 2003), repeated anal- 
ysis of labeled data alone in the above framework 
yields an average prediction error rate on the test 
samples of approximately 31.2%. Repeating this anal- 
ysis but now including the unlabeled data in defin- 
ing the empirical factors yields a semi-supervised av- 
erage error rate of approximately 9.5%. This gives 
some indication of the potential improvements in 
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raw predictive accuracy that may accrue from the 
appropriate use of unlabeled data. 

5. KERNEL REGRESSION FOR PREDICTION 
AND CLASSIFICATION 

5.1 Kernel Regression Models 

An interesting class of examples, which is central 
to the methodological interfaces of statistics and 
machine learning, arises in models based on ker- 
nel regression. Kernel and related smoothing spline 
methods have a long history in applied statistics 
and have seen a tremendous amount of development 
at the interfaces of machine learning and statistics 
in the last several years (Poggio and Girosi, 1990; 
Wahba, 1990; Vapnik, 1998; Scholkopf and Smola, 
2002; Shawe- Taylor and Cristianini, 2004; Liang 
et al., 2007). 

The context is nonparametric, nonlinear regres- 
sion with y G M, x G M'^, and a model of the form 

(6) y = /(x) + e, 

where e is a zero-mean noise term and / is an un- 
certain regression function. As an example, the class 
of Bayesian radial basis (RB) models (Liang et al., 
2007) deals with questions of proper probability 
models — and the resulting proper inference and pre- 
dictive results that then arise — for uncertain knots 
in a kernel model. This framework, and other ap- 
proaches, begin with the interest in a representation 

(a} 

Bfl-i ■ * — i 




of the form 

(7) f{x) = J w{u)K{x,u)dG{u) 

for some weight function w{u) over /c-dimensional 
u, and some specified kernel function K {■,■). The 
element G{-) is the unknown probability distribution 
function for X. The key to the model is to note that, 
if G is discrete and puts masses gi at support points 
(or "knots") Ui, then the expression for /(•) is simply 

i 

that is, a radial basis function representation. The 
analysis of Liang et al. (2007) describes approxi- 
mations to a model in which uncertainty about G 
is expressed using a Dirichlet process prior (Fergu- 
son, 1973; Escobar and West, 1995). One implica- 
tion of such a model for G is that, since Dirichlet 
processes are discrete with probability 1, the for- 
mal mathematical model for f{x) is the sum above 
with a countably infinite number of knots Ui . Prom 
the methodological viewpoint, both labeled and un- 
labeled X values provide information about G di- 
rectly. In fact, with a sample of n labeled and/or 
unlabeled x values xi,. . . ,Xn (whether from X, 
or some combination of the two) , this Dirichlet pro- 
cess model implies that / may be approximated by 

n 

(8) fn{x) =^Wn,iK{x,Xi), 

i=l 



P) 
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Fig. 2. MNI ST handwritten digit data example, with samples of handwritten "6" (+) and "9" ({>). The axes are the first 
two principal components computed from a singular value decomposition of the centered data, using the full data set of over 
11,000 samples (1)918 "6"s and 5949 "9"s). Scatter plotted on these empirical factors are (a) the training data of 800 samples 
(^400 of each digit), and (b) the remaining test data. The two factors evidently carry strongly discriminating information. 
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Fig. 3. MNIST handwritten digit data example, in a format similar to that of Figure 2. In these frames the two principal 
components were evaluated on only four samples selected as labeled, two of each digit, and the scatter plots are of the training 
data of 800 samples. The four labeled samples were randomly drawn from the training data set, and the four frames here 
represent four different draws of the labeled samples. Though there is evidence of discriminatory information to distinguish 
the handwritten "6'^s (+) from the "9"s {()), it is quite clear that discriminatory power will be very limited. 



where WnA (xw{xi). The key methodological rele- 
vance of this approach is that this is true for all n, 
providing consistency as sample size increases and 
additional design points are observed. This leads to 
the practical model in which each y^, is linearly re- 
gressed on the set of kernel predictors Xj)}"^^ 
based on whatever set of design points is observed. 
A complete model now involves a prior distribution 
over the induced regression coefficients 'Wn,i and we 



note that this explicitly depends on n and the re- 
alized Xj. Hence, in both the structure of the re- 
gression model and in the requirements for a prior 
over coefficients, we see the dependence on all values 
observed in the x space; this is therefore a perfect 
example of when, why and how unlabeled X"^ data 
matters. In particular, we note that: 

• 6 = G(-) so that p{x\6) dx = dG{x) — the parame- 
ter is the full distribution function itself. 
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• Equations (6) and (7) show exphcitly how p{y\x, cp) 
depends intimately on 6 = G as defining the non- 
linear kernel regression; in fact, 9 (^(p in this case. 
Thus prior and posterior dependence of and (p 
is central to the model. 

• As a result, unlabeled X"^ provides direct, imme- 
diate and critically relevant information in pre- 
dicting y^,. 

This specific example of a kernel regression model 
derived within a coherent probabilistic framework, 
taken from Liang et al. (2007), is presented for its 
simplicity and also because it represents a fully spec- 
ified probabilistic model in which the kernel weights 
Wn,i are related coherently as sample sizes change. 
Some additional connections and related kernel re- 
gression formulations are now mentioned. 

5.2 Relation to Machine Learning Kernel 
Regression Algorithms 

Other constructions of kernel regression models, 
including those utilizing Gaussian processes and 
spline smoothing, non-Bayesian uses of radial ba- 
sis functions and others (Poggio and Girosi, 1990; 
Wahba, 1990; Scholkopf and Smola, 2002; Vapnik, 
1998; Shawe- Taylor and Cristianini, 2004), exhibit 
the same structure and consequent dependence on 
unlabeled data. One interesting connection with re- 
cent theoretical developments in machine learning 
approaches arises by noting that the central model 
of (7) also corresponds to the solution of the non- 
linear manifold regularization formulation of Belkin, 
Niyogi and Sindhwani (2004). This approach, moti- 
vated by geometric arguments, is an optimization 
algorithm that minimizes 



arg mm 

fdHK 



-J2vif{xi),yi 



i=l 



+ 1A 



where {(2/j, Xj)}f^]^ are the labeled data, TLk is a re- 
producing kernel Hilbert space (RKHS), V{f{x),y) 
is a loss function, is the RKHS norm, 7a, 7/ 

are regularization parameters and ||/||/ is a norm 
that reflects the smoothness of the function on the 
marginal p{x). If the marginal is concentrated on 
a manifold, x C M ^M.^, then a natural choice for 
II /II J is the Laplacian on the manifold. The marginal 
p{x) is generally unknown; with unlabeled data X"^ 
from the marginal, the Laplacian on the manifold 



may be approximated by a Laplacian on the graph 
defined by the observed data (labeled and unlabeled) 



fn{x) = argmin 



-Y^v{f{x.),m:] 



i=l 



+ 7A 



K + 



7/ 



(n + n„,)2 



f^Lf 



where L is the graph Laplacian on all the data (given 
a weight matrix on the graph) and f = {f{xi), . . . , 
f{xn)J{x^), . . . , f{x^^)}. The above optimization 
is achieved by 



i=l 



aK{x, Xi) + Wn+nrr„n+iK{x, X™), 



1=1 



which takes the same form as (8). This formula- 
tion as an optimization problem from a statistical 
machine learning viewpoint generates precisely the 
same functional form of the model as that derived 
from a nonparametric regression in the Bayesian 
framework above, and the consequences for the use 
of unlabeled data in model formulation are the same. 

5.3 Illuminating the Potential Impact of 
Unlabeled Data 

A simple but illuminating synthetic example pro- 
vides an initial illustration. A data set of 50 points 
{{xi,yi),i = 1:50} is plotted in Figure 4(b); here 
Xi € M? and yi = 0/1. This data set can be eas- 
ily classified according to y = versus y = 1 by a 
Gaussian kernel model, and we fit such a model 
using the Bayesian model completion — in terms of 
prior specification for the kernel weights and ob- 
servational variance parameters — and the resulting 
MCMC method for model fitting as described in 
Liang et al. (2007). Though the details of the prior 
specification and computation are not central here, 
we note that the model involves use of a generalized 
shrinkage prior, termed generalized g-prior by West 
(2003), on the kernel regression coefficients. This is a 
method of importance when dealing with large num- 
bers of regression parameters, and its use in these 
kernel models where the number of regression pa- 
rameters exceeds the number of labeled observations 
is particularly apt. 

The analysis leads to the computation of poste- 
rior predictive probabilities for y = 1 versus y = 
at any chosen new x value, that is, the class predic- 
tions based on any data set. Using the fully labeled 
50 data points for such an analysis yields results 
displayed and described in Figure 4(a). 
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Fig. 4. Kernel regression example using synthetic data. Frame (b) displays a scatter plot of the 50 observations on the xi 
versus xi axes, with cases yi = 1 as "+" (blue) and j/i = as "o" (red). The binary kernel regression model analysis of the 
full set of 50 labeled observations produces approximate posterior predictive probabilities Pr^y— l\x,D) at any point x in the 
plane; the green "*" points in frame (b) are points at which Pr(y — l\x,D) = 0.5, that is, represent points on the separating 
contour. Frame (a) displays a color image of the contours of Pr{y = l\x, D) as x varies; red corresponds to the conditional 
probability being near and blue near 1. 
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Fig. 5. Kernel regression example with displays as in Figure ^: (a) using only four selected data points; (b) using the same 
four selected data points but also including the unlabeled X"^ data. 



14 



F. LIANG, S. MUKHERJEE AND M. WEST 




14 
I 



-X 
-IS 



-a 
-1 

-O.bl 

D 

a5 

1 



-I.S-t -fllSO 0.5 1 l.fi 2 



O 



^-'U -1 ^..5 D a.i I 1,E 



-2 -I.B-t -D 5 D D.5 1 1.5 i 



I 



0.5 



i 

1*. 

1 

as 

a 

a 

r 

-11 



.3 



-&-1.S-1 -D.S as 1 1,5 2 



Fig. 6. Kernel regression example: (a) using eight labeled data points; (b) using the eight labeled data points together with 
the remaining unlabeled samples. Notice that the use of the 42 unlabeled samples is sufficient to produce predictive probability 
contours that are very similar to those using the full set of labeled observations, as in Figure 4, though with evident and 
justifiably greater uncertainty, even though the y values are labeled on only eight data points. 



The analysis is repeated using only four labeled 
points, two each with y = and y = 1; the four ran- 
domly selected points are marked in Figure 5. The 
resulting class prediction contours and summaries 
of predictions for the 46 unlabeled points are then 
computed in two separate analyses: (i) using only 
the labeled data — ^just the four points; and (ii) using 
the labeled and unlabeled data. Figure 5 presents 
the results of these two analyses. This exercise was 
repeated using a total of eight labeled points, result- 
ing in the displays in Figure 6. From the figures the 
major impact of unlabeled data is clearly apparent, 
and its relevance with very small numbers of labeled 
samples highlighted. We also see that the semisuper- 
vised analysis using only eight labeled samples re- 
sults in predictions that are very similar to those ob- 
tained if all 50 samples were labeled. Unlabeled data 



can dramatically impact upon and improve predic- 
tion accuracy. 

5.4 Kernel Regression for Cancer Classification 
Using Genomic Data 

A substantive example involves analysis of a gene 
expression data set consisting of DNA microarray 
expression profiles from 190 tissue samples repre- 
senting a variety of different primary tumors (breast, 
prostate, lung, lymphoma, etc.) and 90 noncancer- 
ous, "normal" samples from the corresponding tis- 
sue of origin (Ramaswamy et al., 2001; Mukher- 
jee et al., 2003). Following standard processes of 
data normalization and screening for genes show- 
ing nontrivial variation, the data analyzed consists 
of p = 2800 gene expression variables, or "genes," 
on the set of 280 samples. The analysis setup aims 
to use the gene expression data as predictors (x) 
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Fig. 7. Results from predictions of cancer versus normal tissue (y — 0,1) based on gene expression data (x ), showing empirical 
prediction error rates from the analysis ignoring the unlabeled data (solid line) compared to those from the analysis including 
the unlabeled data (dashed line). These results were developed by repeating the analyses with varying percentages of the data 
(horizontal axis) randomly designated as unlabeled. The selection of unlabeled cases and model analysis was rerun 50 times 
for each chosen unlabeled percentage. The graph represents the average prediction error in predicting the true status (j/ = 1 
versus y~0 with predictive probability thresholded at 0.5). The uniform improvement in empirical predictive accuracy when 
including the unlabeled data is clear. 



in a binary kernel regression model with outcome 
y = 1 representing "cancer," that is, any of the can- 
cer types, and y = representing normals. As in the 
synthetic example above, the analysis uses a Gaus- 
sian kernel model fitted using Bayesian shrinkage 
priors, as in Liang et al. (2007). Our interest is to 
compare predictions of cancer versus normal under 
this given model and prior specification applied to 
differing selections of data — selections that allow us 
to examine the impact of unlabeled data. 

We do this as in the synthetic example above — 
randomly selecting a fraction of the data to be re- 
garded as unlabeled, fitting the model and then pre- 
dicting the status (cancer versus normal) of the se- 
lected unlabeled cases in terms of posterior predic- 
tive probabilities. We repeat this analysis twice — 
first, using only the labeled data; second, using both 
the labeled and unlabeled data — and are then able 
to compare predictions between the two analyses to 
assess the changes due to use of the unlabeled data. 
For a given fraction of unlabeled data, we repeated 
this 50 times, each time randomly selecting the cases 
to be labeled/unlabeled, and computing the average 
(across the 50 repeats) empirical prediction error 
rate in classifying the unlabeled cases. The predic- 
tion of an unlabeled sample is regarded as "correct" 



if the predictive probability of the true state (can- 
cer or normal) exceeds 0.5. Figure 7 summarizes the 
resulting empirical error rates for a series of such 
analyses in which we progressively increased the per- 
centage of labeled data from 10% to 90%. The figure 
clearly shows the differences between analysis using 
only labeled data and that using labeled and unla- 
beled data. 

Additional insight into the impact of including un- 
labeled samples is given in Figure 8. From one anal- 
ysis with 80% of the data unlabeled, we select 10 
each of the cancer and normal samples that were 
unlabeled in the analysis, and graph the estimated 
predictive probabilities of cancer versus normal with 
approximate 95% credible interval. This shows the 
impact of the unlabeled x data on the predictions, 
in terms of the impact on estimates of prediction 
uncertainty as well as empirical accuracy. 

6. SUMMARY COMMENTS 

Beginning with an articulation of the basic sam- 
pling and design specifications underlying statistical 
formulations of prediction problems, we have delin- 
eated the conceptual and theoretical issues under- 
lying the use and relevance, or irrelevance, of unla- 
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beled data in classification and prediction problems. 
This, coupled with a series of examples in central 
statistical modeling contexts, and empirical illustra- 
tions and evaluations in two substantive data analy- 
ses, provides an overview and synthesis of the ideas 
underlying the emerging methodology of semisuper- 
vised learning in the machine learning and statistics 
literatures. 

Graphical model representations of the joint sam- 
pling model context aid in this interpretation. The 



relevance, or otherwise, of the unlabeled X"^ data 
can be deduced essentially by inspection of the im- 
plied (undirected) graphical representation of any 
full model structure. For example, the full distribu- 
tion assuming joint sampling, and in cases for which 
p{4>,9) =p{(p)p{0), is illustrated in graphical terms 
in Figure 9. The joint density exhibited here is 

p{y,,x,,Y,X,X"',ct>,e) 

= p{y, \x, , (A)p(y |x, ci))p{x\e)p{x,\e) 
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Fig. 8. Cancer versus normal predictions using kernel model. The figure displays estimated predictive probabilities of cancer 
versus normal for 10 cancers (*) and 10 normal tissues (o) that were unlabeled in the data analysis. This analysis involved only 
20% of the data being labeled. The frames also provide estimated 95% credible intervals associated with each of the predictions. 
This shows the impact of the unlabeled x data on the predictions, in terms of the impact on estimates of prediction uncertainty 
as well as empirical accuracy. 
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Fig. 9. Graphical models of the basic structure relevant to understanding the role of unlabeled data in predictive modeling. 
The figure shows the directed (acyclic) graph (a) and undirected graph (b) of the joint distribution of data and parameters in 
cases of independence of (j),9. In contrast, if4>,9 are dependent, then (a) would have an edge between (f> and 9, and (b) would 
be a fully connected graph. 
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Figure 9(a) is a directed acyclic graph of the joint 
distribution structured in terms of composition of 
samphng distributions. Figure 9(b) displays the cor- 
responding undirected graph in which the lack of an 
edge between X"^ and indicates conditional in- 
dependence given all other quantities, hence the ir- 
relevance to prediction of the unlabeled data in this 
case. In contrast, were <j), to be a priori dependent, 
then the five nodes of the undirected graph would 
be fully connected, exhibiting the relevance of the 
unlabeled data to prediction of y*. 

In addition to clarifying and exemplifying the struc- 
ture of models and the prediction problem with un- 
labeled data, one aim of this work has been to review 
the area to provide a link across the mainstream sta- 
tistical and machine learning communities. We hope 
that this will entice more statistical researchers into 
a very active, productive and exciting research mi- 
lieu, while also founding the discussion in venera- 
ble, simple and unambiguous terms arising from the 
direct and classical probabilistic formulation. This 
view directly, we believe, addresses and answers the 
questions of "when, why and how" unlabeled data 
help in predictive modeling. 
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